[Misc] Add 20 regression tests for 11 tool parser bug fixes#38172
[Misc] Add 20 regression tests for 11 tool parser bug fixes#38172chaunceyjiang merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive suite of regression tests across multiple tool parsers, including DeepSeekV32, GLM4-MoE, KimiK2, MinimaxM2, Mistral, Qwen3, and Step3p5. These tests address various edge cases and potential issues such as delimiter preservation, skip_special_tokens logic, handling of zero-argument and malformed tool calls, Unicode character preservation, native tool call ID extraction, anyOf nullable parameter parsing, fast detokenization, and streaming behavior with multi-parameter and variable-sized chunks. A review comment highlights a malformed JSON string in a MinimaxM2 test, which needs to be corrected to ensure the test functions as intended.
|
Gemini got confused counting JSON braces within the xml tags within the Python strings of the test, but I double-checked the test it highlighted just to be sure. And then gave it a thumbs-down for good measure, in case they use that to improve training. |
sfeng33
left a comment
There was a problem hiding this comment.
LGTM, thank you for the thorough work!
|
This pull request has merge conflicts that must be resolved before it can be |
|
@bbrowning there is a conflict merge here. canyou fix this? |
Audited recent tool parser bug-fix PRs and found that several landed without corresponding test coverage. Added unit tests for each fix to prevent regressions. - Mistral: fast detokenization text detection (PR vllm-project#37209) - Qwen3Coder: malformed XML crash, anyOf double-encoding, speculative decode streaming (PRs vllm-project#36774, vllm-project#36032, vllm-project#35615) - DeepSeekV32: delimiter preservation with fast detokenization, skip_special_tokens adjustment (PR vllm-project#33964) - GLM-4 MoE: zero-argument tool calls, transformers 5.x delimiter handling, Unicode character preservation (PRs vllm-project#32321, vllm-project#31622, vllm-project#30920) - MiniMax M2: anyOf nullable parameter handling for non-null and null values (PR vllm-project#32342) - Step3p5: MTP-style variable-chunk and multi-token streaming (PR vllm-project#33690) - Kimi K2: native tool call ID extraction and multi-turn ID continuity (PR vllm-project#32768) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben Browning <bbrownin@redhat.com>
|
Rebasing this locally, it appears two of the tests that were previously passing are now failing. I'll investigate, but it looks like between when I opened this PR and now we may have regressed on two of these test cases already... |
…llm-project#38189) After the refactor in vllm-project#38189 to use self.tools instead of request.tools, anyOf regression tests need to provide tools at parser construction time so the schema is available for type resolution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben Browning <bbrownin@redhat.com>
6b6a09c to
386548d
Compare
|
Ok, rebased and force-pushed with the conflict fix as well as adjusting the tests to move to the new format where tools are passed when constructing the parser. That is what initially caused my local failures after fixing the conflict, so it wasn't a regression in our parsers after all but just me needing to update these tests after PR 38189 landed. |
…ject#38172) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
…ject#38172) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
…ject#38172) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
…ject#38172) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
…ject#38172) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
…ject#38172) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Purpose
Claude Code and I audited recent tool parser bug-fix PRs (Sept 2025 until now) and found that several landed without corresponding test coverage. This is purely additive test coverage to prevent regressions as we refactor, cleanup, and redesign some of these areas.
Test Plan
Test Result
All the new tests passed, and all the old ones continue to pass.